23 research outputs found

    Predicting unstable software benchmarks using static source code features

    Full text link
    Software benchmarks are only as good as the performance measurements they yield. Unstable benchmarks show high variability among repeated measurements, which causes uncertainty about the actual performance and complicates reliable change assessment. However, if a benchmark is stable or unstable only becomes evident after it has been executed and its results are available. In this paper, we introduce a machine-learning-based approach to predict a benchmark’s stability without having to execute it. Our approach relies on 58 statically-computed source code features, extracted for benchmark code and code called by a benchmark, related to (1) meta information, e.g., lines of code (LOC), (2) programming language elements, e.g., conditionals or loops, and (3) potentially performance-impacting standard library calls, e.g., file and network input/output (I/O). To assess our approach’s effectiveness, we perform a large-scale experiment on 4,461 Go benchmarks coming from 230 open-source software (OSS) projects. First, we assess the prediction performance of our machine learning models using 11 binary classification algorithms. We find that Random Forest performs best with good prediction performance from 0.79 to 0.90, and 0.43 to 0.68, in terms of AUC and MCC, respectively. Second, we perform feature importance analyses for individual features and feature categories. We find that 7 features related to meta-information, slice usage, nested loops, and synchronization application programming interfaces (APIs) are individually important for good predictions; and that the combination of all features of the called source code is paramount for our model, while the combination of features of the benchmark itself is less important. Our results show that although benchmark stability is affected by more than just the source code, we can effectively utilize machine learning models to predict whether a benchmark will be stable or not ahead of execution. This enables spending precious testing time on reliable benchmarks, supporting developers to identify unstable benchmarks during development, allowing unstable benchmarks to be repeated more often, estimating stability in scenarios where repeated benchmark execution is infeasible or impossible, and warning developers if new benchmarks or existing benchmarks executed in new environments will be unstable

    Multi-Objective Search-Based Software Microbenchmark Prioritization

    Full text link
    Ensuring that software performance does not degrade after a code change is paramount. A potential solution, particularly for libraries and frameworks, is regularly executing software microbenchmarks, a performance testing technique similar to (functional) unit tests. This often becomes infeasible due to the extensive runtimes of microbenchmark suites, however. To address that challenge, research has investigated regression testing techniques, such as test case prioritization (TCP), which reorder the execution within a microbenchmark suite to detect larger performance changes sooner. Such techniques are either designed for unit tests and perform sub-par on microbenchmarks or require complex performance models, reducing their potential application drastically. In this paper, we propose a search-based technique based on multi-objective evolutionary algorithms (MOEAs) to improve the current state of microbenchmark prioritization. The technique utilizes three objectives, i.e., coverage to maximize, coverage overlap to minimize, and historical performance change detection to maximize. We find that our technique improves over the best coverage-based, greedy baselines in terms of average percentage of fault-detection on performance (APFD-P) and Top-3 effectiveness by 26 percentage points (pp) and 43 pp (for Additional) and 17 pp and 32 pp (for Total) to 0.77 and 0.24, respectively. Employing the Indicator-Based Evolutionary Algorithm (IBEA) as MOEA leads to the best effectiveness among six MOEAs. Finally, the technique's runtime overhead is acceptable at 19% of the overall benchmark suite runtime, if we consider the enormous runtimes often spanning multiple hours. The added overhead compared to the greedy baselines is miniscule at 1%.These results mark a step forward for universally applicable performance regression testing techniques.Comment: 17 pages, 5 figure

    Software Microbenchmarking in the Cloud. How Bad is it Really?

    Get PDF
    Rigorous performance engineering traditionally assumes measuring on bare-metal environments to control for as many confounding factors as possible. Unfortunately, some researchers and practitioners might not have access, knowledge, or funds to operate dedicated performance-testing hardware, making public clouds an attractive alternative. However, shared public cloud environments are inherently unpredictable in terms of the system performance they provide. In this study, we explore the effects of cloud environments on the variability of performance test results and to what extent slowdowns can still be reliably detected even in a public cloud. We focus on software microbenchmarks as an example of performance tests and execute extensive experiments on three different well-known public cloud services (AWS, GCE, and Azure) using three different cloud instance types per service. We also compare the results to a hosted bare-metal offering from IBM Bluemix. In total, we gathered more than 4.5 million unique microbenchmarking data points from benchmarks written in Java and Go. We find that the variability of results differs substantially between benchmarks and instance types (by a coefficient of variation from 0.03% to > 100%). However, executing test and control experiments on the same instances (in randomized order) allows us to detect slowdowns of 10% or less with high confidence, using state-of-the-art statistical tests (i.e., Wilcoxon rank-sum and overlapping bootstrapped confidence intervals). Finally, our results indicate that Wilcoxon rank-sum manages to detect smaller slowdowns in cloud environments

    Applying test case prioritization to software microbenchmarks

    Full text link
    Regression testing comprises techniques which are applied during software evolution to uncover faults effectively and efficiently. While regression testing is widely studied for functional tests, performance regression testing, e.g., with software microbenchmarks, is hardly investigated. Applying test case prioritization (TCP), a regression testing technique, to software microbenchmarks may help capturing large performance regressions sooner upon new versions. This may especially be beneficial for microbenchmark suites, because they take considerably longer to execute than unit test suites. However, it is unclear whether traditional unit testing TCP techniques work equally well for software microbenchmarks. In this paper, we empirically study coverage-based TCP techniques, employing total and additional greedy strategies, applied to software microbenchmarks along multiple parameterization dimensions, leading to 54 unique technique instantiations. We find that TCP techniques have a mean APFD-P (average percentage of fault-detection on performance) effectiveness between 0.54 and 0.71 and are able to capture the three largest performance changes after executing 29% to 66% of the whole microbenchmark suite. Our efficiency analysis reveals that the runtime overhead of TCP varies considerably depending on the exact parameterization. The most effective technique has an overhead of 11% of the total microbenchmark suite execution time, making TCP a viable option for performance regression testing. The results demonstrate that the total strategy is superior to the additional strategy. Finally, dynamic-coverage techniques should be favored over static-coverage techniques due to their acceptable analysis overhead; however, in settings where the time for prioritzation is limited, static-coverage techniques provide an attractive alternative

    Applying test case prioritization to software microbenchmarks

    Get PDF
    Regression testing comprises techniques which are applied during software evolution to uncover faults effectively and efficiently. While regression testing is widely studied for functional tests, performance regression testing, e.g., with software microbenchmarks, is hardly investigated. Applying test case prioritization (TCP), a regression testing technique, to software microbenchmarks may help capturing large performance regressions sooner upon new versions. This may especially be beneficial for microbenchmark suites, because they take considerably longer to execute than unit test suites. However, it is unclear whether traditional unit testing TCP techniques work equally well for software microbenchmarks. In this paper, we empirically study coverage-based TCP techniques, employing total and additional greedy strategies, applied to software microbenchmarks along multiple parameterization dimensions, leading to 54 unique technique instantiations. We find that TCP techniques have a mean APFD-P (average percentage of fault-detection on performance) effectiveness between 0.54 and 0.71 and are able to capture the three largest performance changes after executing 29% to 66% of the whole microbenchmark suite. Our efficiency analysis reveals that the runtime overhead of TCP varies considerably depending on the exact parameterization. The most effective technique has an overhead of 11% of the total microbenchmark suite execution time, making TCP a viable option for performance regression testing. The results demonstrate that the total strategy is superior to the additional strategy. Finally, dynamic-coverage techniques should be favored over static-coverage techniques due to their acceptable analysis overhead; however, in settings where the time for prioritzation is limited, static-coverage techniques provide an attractive alternative

    Data-Driven Decisions and Actions in Today’s Software Development

    Full text link
    Today’s software development is all about data: data about the software product itself, about the process and its different stages, about the customers and markets, about the development, the testing, the integration, the deployment, or the runtime aspects in the cloud. We use static and dynamic data of various kinds and quantities to analyze market feedback, feature impact, code quality, architectural design alternatives, or effects of performance optimizations. Development environments are no longer limited to IDEs in a desktop application or the like but span the Internet using live programming environments such as Cloud9 or large-volume repositories such as BitBucket, GitHub, GitLab, or StackOverflow. Software development has become “live” in the cloud, be it the coding, the testing, or the experimentation with different product options on the Internet. The inherent complexity puts a further burden on developers, since they need to stay alert when constantly switching between tasks in different phases. Research has been analyzing the development process, its data and stakeholders, for decades and is working on various tools that can help developers in their daily tasks to improve the quality of their work and their productivity. In this chapter, we critically reflect on the challenges faced by developers in a typical release cycle, identify inherent problems of the individual phases, and present the current state of the research that can help overcome these issues

    Deliberate microbenchmarking of software systems

    Full text link
    Software performance faults have severe consequences for users, developers, and companies. One way to unveil performance faults before they manifest in production is performance testing, which ought to be done on every new version of the software, ideally on every commit. However, performance testing faces multiple challenges that inhibit it from being applied early in the development process, on every new commit, and in an automated fashion. In this dissertation, we investigate three challenges of software microbenchmarks, a performance testing technique on unit granularity which is predominantly used for libraries and frameworks. The studied challenges affect the quality aspects (1) runtime, (2) result variability, and (3) performance change detection of microbenchmark executions. The objective is to understand the extent of these challenges in real-world software and to find solutions to address these. To investigate the challenges’ extent, we perform a series of experiments and analyses. We execute benchmarks in bare-metal as well as multiple cloud environments and conduct a large-scale mining study on benchmark configurations. The results show that all three challenges are common: (1) benchmark suite runtimes are often longer than 3 hours; (2) result variability can be extensive, in some cases up to 100%; and (3) benchmarks often only reliably detect large performance changes of 60% or more. To address the challenges, we devise targeted solutions as well as adapt well-known techniques from other domains for software microbenchmarks: (1) a solution that dynamically stops benchmark executions based on statistics to reduce runtime while maintaining low result variability; (2) a solution to identify unstable benchmarks that does not require execution, based on statically-computable source code features and machine learning algorithms; (3) traditional test case prioritization (TCP) techniques to execute benchmarks earlier that detect larger performance changes; and (4) specific execution strategies to detect small performance changes reliably even when executed in unreliable cloud environments. We experimentally evaluate the solutions and techniques on real-world benchmarks and find that they effectively deal with the three challenges. (1) Dynamic reconfiguration enables to drastically reduce runtime by between 48.4% and 86.0% without changing the results of 78.8% to 87.6% of the benchmarks, depending on the project and statistic used. (2) The instability prediction model allows to effectively identify unstable benchmarks when relying on random forest classifiers, having a prediction performance between 0.79 and 0.90 area under the receiver operating characteristic curve (AUC). (3) TCP applied to benchmarks is effective and efficient, with APFD-P values for the best technique ranging from 0.54 to 0.71 and a computational overhead of 11%. (4) Batch testing, i.e., executing the benchmarks of two versions on the same instances interleaved and repeated as well as repeated across instances, enables to reliably detect performance changes of 10% or less, even when using unreliable cloud infrastructure as execution environment. Overall, this dissertation shows that real-world software microbenchmarks are considerably affected by all three challenges (1) runtime, (2) result variability, and (3) performance change detection; however, deliberate planning and execution strategies effectively reduce their impact

    Continuous Software Performance Assessment: Detecting Performance Problems of Software Libraries on Every Build

    Full text link
    Degradation of software performance can become costly for companies and developers, yet it is hardly assessed continuously. A strategy that would allow continuous performance assessment of software libraries is software microbenchmarking, which faces problems such as excessive execution times and unreliable results that hinder wide-spread adoption in continuous integration. In my research, I want to develop techniques that allow including software microbenchmarks into continuous integration by utilizing cloud infrastructure and execution time reduction techniques. These will allow assessing performance on every build and therefore catching performance problems before they are released into the wild

    A domain-specific language for coordinating collaboration

    No full text
    Zusammenfassung in deutscher SpracheSeit einigen Jahrzehnten ist Software Design und Architektur im Fokus der Wissenschaft und der Industrie. Eine neuere Entwicklung in diesem Bereich ist die Einbeziehung von Menschen als integraler Bestandteil von Software Architektur. Eine solche Architekturbeschreibungssprache mit Fokus auf Menschen als Komponenten ist die Human Architecture Description Language (hADL). hADL gehört zu den struktur-orientierten Sprachen, die detaillierte, nicht restriktive Beschreibungen von Kollaborationen erlauben. Die Instanziierung von Kollaborationen ist jedoch komplex und brüchig. Prozess-orientierte Sprachen erlauben, im Vergleich zu struktur-orientierten Sprachen, bessere Unterstützung für Workflow-Modellierung und Ausführung von Kollaborationen. Der Nachteil von prozess-orientierten Sprachen ist die fehlende Flexibilität im Entwurf von Kollaborationen, welche struktur-orientierte Sprachen mit sich bringen. Eine Kombination dieser beider Paradigmen wird als überlegener Ansatz angesehen. Diese Diplomarbeit untersucht eine Methode, um gültige hADL Programme spezifizieren zu können, die den Aufwand von Entwicklern reduziert. Eine weiteres Ziel ist, die beiden Architekturbeschreibungsparadigmen näher zusammen zu bringen. Eine domänen-spezifische Sprache (DSL) wird mit Xtext, einem Framework zur Entwicklung von DSLs, entwickelt, welche es erlaubt Kollaborationsinstanzen in einer prägnanten Form zu spezifizieren. Der Funktionsumfang der DSL beinhaltet Verknüpfungs- (linkage) und Beobachtungsgrundfunktionalitäten (monitor) von hADL, Kontrollflussanweisungen, Variablen und einen Abstraktionsmechanismus. Mit Hilfe von automatischen überprüfungen wird die Gültigkeit von DSL-Programmen zur übersetzungszeit sichergestellt. Nur gültige Programme werden in Java/hADL-Client Programmcode übersetzt. Weiters wird der Entwicklungskomfort durch IDE Vorschläge und AutöVervollständigungen erhöht. Die DSL wird anhand eines einfachem Scrum Prozesses evaluiert. Die Evaluierung zeigt auf, wie sehr sich der Aufwand eines Entwicklers reduziert, der die DSL statt Java zum Definieren von Kollaborationsinstanzen verwendet. Zwei Szenarien werden implementiert, die die notwendigen Kollaborationsstrukturen auf- und abbauen. Die Ergebnisse zeigen, dass ein signifikant geringerer geistiger (überprüfungen im Kopf) und physischer (Tippen von Zeilen Code) Aufwand nötig ist, um Kollaborationsinstanzen mit der DSL zu definieren. Im Evaluierungsszenario führt die DSL 70 automatische Gültigkeits- und Konsistenzüberprüfungen durch und das DSL-Skript ist um einen Faktor 7,3 kürzer als der automatisch generierte hADL-Client Programmcode.Software design and architecture has been a focus of research and industry for a couple of decades. A more recent development in terms of software architecture is the inclusion of humans into software systems. A human-centered architecture description language that belongs to the structure-centric paradigm is the Human Architecture Description Language (hADL). Structure-centric languages and therefore hADL enable descriptions of detailed, non-restrictive collaboration mechanisms. Nevertheless instantiating collaborations described in hADL is brittle and verbose. Compared to structure-centric languages, process-centric languages offer better support for executing collaborative workflows, but do not have the flexibility for defining collaborations. A combination of those is considered being a superior solution compared to either paradigm on its own. A method for specifying valid hADL programs is examined, that decreases the programmer-s effort drastically. Furthermore the aim is narrowing the gap between process-centric and structure-centric languages. This thesis introduces a Domain-Specific Language (DSL) developed with Xtext (a framework for language engineering), which allows developers to specify hADL collaboration instances in a concise way. The feature set of the DSL includes linkage and monitor primitives of hADL, control flow statements, variables and an abstraction mechanism. Validity and consistency of DSL programs is ensured by compile-time checks . Only valid programs are transformed into Java/hADL client code. Moreover development convenience is increased through IDE suggestions and auto-completions. The DSL is evaluated with a simplified Scrum process. The evaluation highlights the effort a developer saves when using the DSL compared to defining collaborations with plain Java. Collaborations of two distinct scenarios of Scrum are implemented with the DSL. Required collaboration structures are set up and torn down later on. The results show that a significantly lower amount of mental work by the developer and Lines of Code (LOC) are necessary to define collaborations with the DSL compared to Java. The DSL performs 70 validity and consistency checks and the generated hADL client code is by a factor of 7.3 longer.12

    (h|g)opper: Performance History Mining and Analysis

    Full text link
    Performance changes of software systems, and especially performance regressions, have a tremendous impact on users of that system. Historical data can help developers to rea- son about how performance has changed over the course of a software’s lifetime. In this demo paper we present two tools: hopper to mine historical performance metrics based on benchmarks and unit tests, and gopper to analyse the data with respect to performance changes